adversarial example
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Santa Clara County > Cupertino (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (2 more...)
- North America > United States > Illinois (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Information Technology > Security & Privacy (0.68)
- Government (0.68)
Eliminating Catastrophic Overfitting Via Abnormal Adversarial Examples Regularization
However, SSA T suffers from catastrophic overfit-ting (CO), a phenomenon that leads to a severely distorted classifier, making it vulnerable to multi-step adversarial attacks. In this work, we observe that some adversarial examples generated on the SSA T -trained network exhibit anomalous behaviour, that is, although these training samples are generated by the inner maximization process, their associated loss decreases instead, which we named abnormal adversarial examples (AAEs).
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > Canada > British Columbia > Vancouver Island > Capital Regional District > Victoria (0.04)
c1f0b856a35986348ab3414177266f75-Paper-Conference.pdf
Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force.
- North America > United States (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Research Report > Promising Solution (0.47)
- Research Report > New Finding (0.46)
Supplementary Material for Understanding and Improving Ensemble Adversarial Defense
They are used to test the proposed enhancement approach iGA T. In general, ADP employs an ensemble by averaging, i.e., (C 1) ( C 1) Adversarial examples are generated to compute the losses by using the PGD attack. Our main theorem builds on a supporting Lemma 2.1. We start from the cross-entropy loss curvature measured by Eq. The above new expression of T (x) helps bound the difference between h(x) and h(x). Note that these three cases are mutually exclusive.